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Abstract 

Singular Value Decomposition (and Principal Component Analysis) is one of the most widely used techniques 
for dimensionality reduction: successful and efficiently computable, it is nevertheless plagued by a well-known, 
well-documented sensitivity to outliers. Recent work has considered the setting where each point has a few arbitrarily 
corrupted components. Yet, in applications of SVD or PCA such as robust collaborative filtering or bioinformatics, 
malicious agents, defective genes, or simply corrupted or contaminated experiments may effectively yield entire 
points that are completely corrupted. 

We present an efficient convex optimization-based algorithm we call Outlier Pursuit, that under some mild 
^ ■ assumptions on the uncorrupted points (satisfied, e.g., by the standard generative assumption in PCA problems) 

recovers the exact optimal low-dimensional subspace, and identifies the corrupted points. Such identification of 
corrupted points that do not conform to the low-dimensional approximation, is of paramount interest in bioinfor- 
matics and financial applications, and beyond. Our techniques involve matrix decomposition using nuclear norm 
minimization, however, our results, setup, and approach, necessarily differ considerably from the existing line of 
work in matrix completion and matrix decomposition, since we develop an approach to recover the correct column 
■ space of the uncorrupted matrix, rather than the exact matrix itself. In any problem where one seeks to recover a 

structure rather than the exact initial matrices, techniques developed thus far relying on certificates of optimality, 
^ ■ will fail. We present an important extension of these methods, that allows the treatment of such problems. 
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I. Introduction 



This paper is about the following problem: suppose we are given a large data matrix M, and we know 
| it can be decomposed as 

(N ■ M = L + C , 

q where L is a low-rank matrix, and Co is non-zero in only a fraction of the columns. Aside from these 
i—i \ broad restrictions, both components are arbitrary. In particular we do not know the rank (or the row/column 
^ ; space) of L , or the number and positions of the non-zero columns of Cq. Can we recover the column-space 
■ of the low-rank matrix L , and the identities of the non-zero columns of Co, exactly and efficiently? 
We are primarily motivated by Principal Component Analysis (PCA), arguably the most widely used 
rS \ technique for dimensionality reduction in statistical data analysis. The canonical PCA problem [2], seeks to 
c3 ' find the best (in the least-square-error sense) low-dimensional subspace approximation to high-dimensional 
points. Using the Singular Value Decomposition (SVD), PCA finds the lower-dimensional approximating 
subspace by forming a low-rank approximation to the data matrix, formed by considering each point as 
a column; the output of PCA is the (low-dimensional) column space of this low-rank approximation. 

It is well known (e.g., [3]-[6]) that standard PCA is extremely fragile to the presence of outliers: even 
a single corrupted point can arbitrarily alter the quality of the approximation. Such non-probabilistic or 
persistent data corruption may stem from sensor failures, malicious tampering, or the simple fact that 
some of the available data may not conform to the presumed low-dimensional source / model. In terms 
of the data matrix, this means that most of the column vectors will lie in a low-dimensional space - and 
hence the corresponding matrix L will be low-rank - while the remaining columns will be outliers - 
corresponding to the column-sparse matrix C . The natural question in this setting is to ask if we can 
still (exactly or near-exactly) recover the column space of the uncorrupted points, and the identities of 
the outliers. This is precisely our problem. 
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Our results: We consider a novel but natural convex optimization approach to the recovery problem 
above. The main result of this paper is to establish that, under certain natural conditions, the optimum of 
this convex program will yield the column space of L and the identities of the outliers (i.e., the non-zero 
columns of Co). Our conditions depend on the fraction of points that are outliers (which can otherwise be 
completely arbitrary), and incoherence of the row space of L . The latter condition essentially requires that 
each direction in the column space of L be represented in a sufficient number of non-outlier points; we 
discuss in more detail below. We note that our results do not require incoherence of the column space, as 
is done, e.g., in the papers [5], [6]. This is due to to our alternative convex formulation, and our analytical 
approach that focuses only on recovery of the column space, instead of "exact recovery" of the entire 
Lq matrix. This also means our method's performance is rotation invariant - in particular, applying the 
same rotation to all given points (i.e., columns) will not change its performance. This is again not true 
for the method in [5], [6]. Finally, we extend our analysis to the noisy case when all points - outliers or 
otherwise - are additionally corrupted by noise. 

Related Work 

Robust PCA has a long history (e.g., [4], [7]— [13]). Each of these algorithms either performs standard 
PCA on a robust estimate of the covariance matrix, or finds directions that maximize a robust estimate of 
the variance of the projected data. These algorithms seek to approximately recover the column space, and 
moreover, no existing approach attempts to identify the set of outliers. This outlier identification, while 
outside the scope of traditional PCA algorithms, is important in a variety of applications such as finance, 
bio-informatics, and more. 

Many existing robust PCA algorithms suffer two pitfalls: performance degradation with dimension 
increase, and computational intractability. To wit, [14] shows that several robust PCA algorithms includ- 
ing M-estimator [15], Convex Peeling [16], Ellipsoidal Peeling [17], Classical Outlier Rejection [18], 
Iterative Deletion [19] and Iterative Trimming [20] have breakdown points proportional to the inverse of 
dimensionality, and hence are useless in the high dimensional regime we consider. 

Algorithms with non-diminishing breakdown point, such as Projection-Pursuit [21] are non-convex or 
even combinatorial, and hence computationally intractable (NP-hard) as the size of the problem scales. In 
contrast to these, the performance of Outlier Pursuit does not depend on the dimension, p, and its running 
time scales gracefully in problem size (in particular, it can be solved in polynomial time). 

Algorithms based on nuclear norm minimization to recover low rank matrices are now standard, since 
the seminal paper [22]. Recent work [5], [6] has taken the nuclear norm minimization approach to the 
decomposition of a low-rank matrix and an overall sparse matrix. At a high level, these papers are close 
in spirit to ours. However, there are critical differences in the problem setup, and the results; for one 
thing, the algorithms introduced there fail in our setting, as they cannot handle outliers — entire columns 
where every entry is corrupted. Beyond this, our approach differs in key analysis techniques, which we 
believe will prove much more broadly applicable and thus of general interest. 

In particular, our work requires a significant extension of existing techniques for matrix decomposition, 
precisely because the goal is to recover the column space of L (the principal components, in PCA), as 
opposed to the exact matrices. Indeed, the above works investigate exact signal recovery — the intended 
outcome is known ahead of time, and one just needs to investigate the conditions needed for success. In 
our setting, however, the convex optimization cannot recover L itself exactly. We introduce the use of an 
oracle problem, defined by the structure we seek to recover (here, the true column space). This enables 
us to show that our convex optimization-based algorithm recovers the correct (or nearly correct, in the 
presence of noise) column space, as well as the identity of the corrupted points, or outliers. 

We believe that this line of analysis will prove to be much more broadly applicable. Often times, 
exact recovery simply does not make sense under strong corruption models (such as complete column 
corruption) and the best one can hope for is to capture exactly or approximately, some structural aspect 
of the problem. In such settings, it may be impossible to follow the proof recipes laid out in works such 
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as [5], [6], [22], [23], that essentially obtain exact recovery from their convex optimization formulations. 
Thus, in addition to our algorithm and our results, we consider the particular proof technique a contribution 
of potentially general interest. 



II. Problem Setup 

The precise PCA with outlier problem that we consider is as follows: we are given n points in p- 
dimensional space. A fraction 1 — 7 of the points lie on a r-dimensional true subspace of the ambient 
W, while the remaining 772 points are arbitrarily located - we call these outliers/corrupted points. We 
do not have any prior information about the true subspace or its dimension r. Given the set of points, we 
would like to learn (a) the true subspace and (b) the identities of the outliers. 

As is common practice, we collate the points into a p x n data matrix M, each of whose columns is 
one of the points, and each of whose rows is one of the p coordinates. It is then clear that the data matrix 
can be decomposed as 

M = L + C . 

Here C is the column-sparse matrix ((1 — 7)72 columns are zero) corresponding to the outliers, and 
Lq is the matrix corresponding to the non-outliers. Thus, rank(L ) = r, and we assume its columns 
corresponding to non-zero columns of Co are identically zero (whatever those columns were cannot 
possibly be recovered). Consider its Singular Value Decomposition (SVD) 

L = U Z V T . (1) 

The columns of U form an orthonormal basis for the r-dimensional subspace we wish to recover. C is 
the matrix corresponding to the outliers; we will denote the set of non-zero columns of C by X , with 
|X 1 = jn. These non-zero columns are completely arbitrary. 

With this notation, out intent is to exactly recover the column space of L , and the set of outliers X . 
All we are given is the matrix M. Clearly, exact recovery is not always going to be possible (regardless 
of the algorithm used) and thus we need to impose a few weak additional assumptions. We develop these 
in Section III-AI below. 

We are also interested in the noisy case, where 

M = L + C + N, 

and N corresponds to any additional noise. In this case we are interested in approximate identification of 
both the true subspace and the outliers. 



A. Incoherence: When can the column space be recovered ? 

In general, our objective of recovering the "true" column-space of a low-rank matrix that is corrupted 
with a column-sparse matrix is not always a well defined one. As an extreme example, consider the case 
where the data matrix M is non-zero in only one column. Such a matrix is both low-rank and column- 
sparse, thus the problem is unidentifiable. To make the problem meaningful, we need to impose that the 
low-rank matrix L cannot itself be column-sparse as well. This is done via the following incoherence 
condition. 

Definition: A matrix L £ W xn with SVD L = UT.V 7 , and (1 — j)n of whose columns are non-zero, 
is said to be column-incoherent with parameter p if 

max ||V T ei|| 2 < - fir 



-7)71' 

where {ej} are the coordinate unit vectors. 

Thus if V has a column aligned with a coordinate axis, then p = (1 —7)71/7". Similarly, if V is perfectly 
incoherent (e.g., if r = 1 and every non-zero entry of V has magnitude 1/ y/ (1 — jjn) then p, = 1. 
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In the standard PCA setup, if the points are generated by some low-dimensional isometric (e.g., Gaus- 
sian) distribution, then with high probability, one will have /i = 0(max(l, log(n)/r)) [24]. Alternatively, 
if the points are generated by a uniform distribution over a bounded set, then fi = 0(1). 

A small incoherence parameter jx essentially enforces that the matrix L will have column support that 
is spread out. Note that this is quite natural from the application perspective. Indeed, if the left hand 
side is as big as 1, it essentially means that one of the directions of the column space which we wish 
to recover, is defined by only a single observation. Given the regime of a constant fraction of arbitrarily 
chosen and arbitrarily corrupted points, such a setting is not meaningful. Having a small incoherence ll 
is an assumption made in all methods based on nuclear norm minimization up-to-date [5], [6], [24], [25]. 
Also unidentifiable is the setting where a corrupted point lies in the true subspace. Thus, in matrix terms, 
we require that every column of C does not lie in the column space of L . 

We note that this condition is slightly different from the incoherence conditions required for matrix 
completion in e.g. [24]. In particular, matrix completion requires row-incoherence (a condition on U of 
the SVD) and joint-incoherence (a condition on the product UV) in addition to the above condition. We 
do not require these extra conditions because we have a more relaxed objective from our convex program 
- namely, we only want to recover the column space. 

The parameters [i and 7 are not required for the execution of the algorithm, and do not need to be 
known a priori. They only arise in the analysis of our algorithm's performance. 

Other Notation and Preliminaries: Capital letters such as A are used to represent matrices, and 
accordingly, A4 denotes the i th column vector. Letters U, V, Z and their variants (complements, subscripts, 
etc.) are reserved for column space, row space and column support respectively. There are four associated 
projection operators we use throughout. The projection onto the column space, U, is denoted by Vxj 
and given by Vu(A) = UU T A, and similarly for the row-space Vy(A) = AVV T . The matrix Vx(A) is 
obtained from A by setting column A* to zero for all i ^ I. Finally, Vt is the projection to the space 
spanned by U and V, and given by Vt(-) = Vjj(-) + ^V(") ~ 'Pu'Pvi')- Note that Vt depends on U and 
V, and we suppress this notation wherever it is clear which U and V we are using. The complementary 
operators, Vu±,Vy±, Vt^ and Vx^ are defined as usual. The same notation is also used to represent a 
subspace of matrices: e.g., we write A 6 Vu for any matrix A that satisfies Vu{A) = A. Five matrix 
norms are used: || v4 1| ^ is the nuclear norm, ||A|| is the spectral norm, ||A|| 12 is the sum of £ 2 norm of 
the columns Ai, WAW^^ is the largest £ 2 norm of the columns, and \\A\\ F is the Frobenius norm. The 
only vector norm used is || • || 2 , the £ 2 norm. Depending on the context, / is either the unit matrix, or the 
identity operator; is the i th standard basis vector. The SVD of L is UqYIqVq. We use r to denote the 
rank of L , and 7 = \X \/n the fraction of outliers. 

III. Main Results and Consequences 

While we do not recover the matrix L , we show that the goal of PCA can be attained: even under 
our strong corruption model, with a constant fraction of points corrupted, we show that we can - under 
mild assumptions - exactly recover both the column space of L (i.e., the low-dimensional space the 
uncorrupted points lie on) and the column support of C (i.e. the identities of the outliers), from M. If 
there is additional noise corrupting the data matrix, i.e. if we have M = L + C + N, a natural variant 
of our approach finds a good approximation. In the absence of noise, an easy post-processing step is in 
fact able to exactly recover the original matrix L . We emphasize, however, that the inability to do this 
simply via the convex optimization step, poses significant technical challenges, as we detail below. 

A. Algorithm 

Given the data matrix M, our algorithm, called Outlier Pursuit, generates (a) a matrix U* , with 
orthonormal rows, that spans the low-dimensional true subspace we want to recover, and (b) a set of 
column indices I* corresponding to the outlier points. 
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Algorithm 1 Outlier Pursuit 

Find (L*,C*), the optimum of the following convex optimization program 

Minimize: ||L||* + A||C||i )2 

Subject to: M = L + C 

Compute SVD L* = U^V? and output U* = U v 

Output the set of non-zero columns of C*, i.e. J* = {j : c* ^ for some i} 



While in the noiseless case there are simple algorithms with similar performance, the benefit of the 
algorithm, and of the analysis, is extension to more realistic and interesting situations where in addition 
to gross corruption of some samples, there is additional noise. Adapting the Outlier Pursuit algorithm, we 
have the following variant for the noisy case. 



» T • ^ ... -r. Minimize: L L + A h- 

Noisy Outlier Pursuit: „ .. ,. "* .iU. 11 '^ 0) 

J Subject to: \\M — (L + C)\\ F < e 

Outlier Pursuit (and its noisy variant) is a convex surrogate for the following natural (but combinatorial 
and intractable) first approach to the recovery problem: 

Minimize: rank(L) + A||C||o, c 

Subject to: M = L + C { ) 

where || ■ || 0)C stands for the number of non-zero columns of a matrix. 

B. Performance 

We show that under rather weak assumptions, Outlier Pursuit exactly recovers the column space of the 
low -rank matrix L , and the identities of the non-zero columns of outlier matrix C . The formal statement 
appears below. 

Theorem 1 (Noiseless Case): Suppose we observe M = Lq + Cq, where L has rank r and incoherence 
parameter /i. Suppose further that C is supported on at most 772 columns. Any output to Outlier Pursuit 
recovers the column space exactly, and identifies exactly the indices of columns corresponding to outliers 
not lying in the recovered column space, as long as the fraction of corrupted points, 7, satisfies 

< 1, (5) 



1 — 7 fir' 

where c\ — This can be achieved by setting the parameter A in the Outlier Pursuit algorithm to be 
jj= - in fact it holds for any A in a specific range which we provide below. 

Note that we only need to know an upper bound on the number of outliers. This is because the success 
of Outlier Pursuit is monotonic: if it can recover the column space of L with a certain set of outliers, it 
will also recover it when an arbitrary subset of these points are converted to non-outliers (i.e., they are 
replaced by points in the column space of L ). 

For the case where in addition to the corrupted points, we have noisy observations, M = M + N, we 
have the following result. 

Theorem 2 (Noisy Case): Suppose we observe M = M + N = L + C + N, where 

< ^, (6) 



1 — 7 fir ' 

with C2 = j^i, and ||iV||jr < e. Let the output of Noisy Outlier Pursuit be L', C. Then there exists L, C 
such that M = L + C, L has the correct column space, and C the correct column support, and 



\L' -L\\ F < 10^; \\C - C\\ F < V\/ne. 
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The conditions in this theorem are essentially tight in the following scaling sense (i.e., up to universal 
constants). If there is no additional structure imposed beyond what we have stated above, then up to 
scaling, in the noiseless case, Outlier Pursuit can recover from as many outliers (i.e., the same fraction) 
as any algorithm of possibly arbitrary complexity. In particular, it is easy to see that if the rank of the 
matrix L is r, and the fraction of outliers satisfies 7 > l/(r + 1), then the problem is not identifiable, 
i.e., no algorithm can separate authentic and corrupted points. In the presence of stronger assumptions 
(e.g., isometric distribution) on the authentic points, better recovery guarantees are possible [26]. 

IV. Proof of Theorem Q] 

In this section and the next section, we prove Theorem Q] and Theorem [2] Past matrix recovery papers, 
including [5], [6], [24], sought exact recovery. As such, the generic (and successful) roadmap for the proof 
technique was to identify the first-order necessary and sufficient conditions for a feasible solution to be 
optimal, and then show that a subgradient certifying optimality of the desired solution exists under the 
given assumptions. In our setting, the outliers, Co, preclude exact recovery of L . In fact, the optimum 
L of © will be non-zero in every column of Co that is not orthogonal to L 's column space - that is, 
Outlier Pursuit © cannot recover L on the columns corresponding to the outliers (intuitively, no method 
can - there is nothing left to recover once the entire point is corrupted, and our choice of setting the 
corresponding columns of L to zero is arbitrary). Thus a dual certificate certifying optimality of (L , C ) 
will not exist, in general. However, all we require for success is to recover a pair (L, C) where L has the 
correct column space and C the correct column support. And thus, rather than construct a dual certificate 
for optimality of (L ,Co), ai l we need is a dual certificate for any pair (L,C) as above. The challenge 
is that we do not know, a priori, what that pair will be, and hence cannot follow the standard road map 
to write optimality conditions for a specific pair. 

The main new ingredient of the proof of correctness and the analysis of the algorithm, is the introduction 
of an oracle problem with additional side constraints, that produces a solution with the correct column 
space and support. Thus, we have the following: 

Roadmap of the Proof 

1) We define an oracle problem, with additional side constraints that enforce the right column space 
and support. 

2) We then write down the properties a dual certificate must satisfy to certify optimality of the solution 
to the oracle problem. 

3) We construct a dual certificate, thereby obtaining conditions for the range of A for which recovery 
is guaranteed. 

Before going into technical details, we list some technical preliminaries that we use multiple times in 
the sequel. The following lemma is well-known, and gives the subgradient of the norms we consider. 

Lemma 1: For any column space U, row space V and column support X: 

1) Let the SVD of a matrix A be UT I V T . Then the subgradient to || • ||* at A is {UV T + W\V T {W) = 
0, ||W|| < 1}. 

2) Let the column support of a matrix A be X. Then the subgradient to || ■ ||i ;2 at A is {H + Z\Vi(H) = 
H, Hi = Ai/WAh) Vx(Z) = 0, WZW^z < 1}. 

3) For any A, B, we have Vx(AB) = AV X {B); for any A, VuV x (A) = V x Vu(A). 

Lemma 2: If a matrix H satisfies ||if||t»,2 < 1 an d is supported on X, then || < y/\T\. 
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Proof: Using the variational form of the operator norm, we have 

II if II = max x T fiy 

l|x||2<l,||y|| 2 <l 



max llx H\\ 2 = max 

IWIa<l l|x|| 2 <l 



i€I 

The inequality holds because ||ifi||2 = 1 when i G X, and equals zero otherwise. 



Lemma 3: Given a matrix U G W xn with orthonormal columns, and any matrix V G W xn , we have 

oo,2 



that II^^Hoo o = maxj ||V"~ r e i || 2 - 



Proof: By definition we have 

||^ T ||oo, 2 = max||f/\/ 4 T || 2 ( = } max ||V; T || 2 = max \\V T ei\\ 2 . 

i i i 

Here (a) holds since U has orthonormal columns. ■ 

A. Oracle Problem and Optimality Conditions 

As discussed, in general Outlier Pursuit will not recover the true solution (L , C ), and hence it is not 
possible to construct a subgradient certifying optimality of (Lo,Co). Instead, our goal is to recover any 
pair (L, C) so that L has the correct column space, and C the correct column support. Thus we need only 
construct a dual certificate for some such pair. We develop our candidate solution (L, C) by imposing 
precisely these constraints on the original optimization problem 0: the solution L should have the correct 
column space, and C should have the correct column support. 

the space of all matrices with column space contained in U is given by Vjj (X) := U U^X. Similarly 
for the column support X of the true C , the projection Vx (X) is the matrix that results when all the 
columns in Xq are set to 0. 



Let the SVD of the true L be L = UqE VJ, and recall that the projection of any matrix X onto 



Note that Uq and X above correspond to the truth. Thus, with this notation, we would like the optimum 
of © to satisfy Vu (L) = L, as this is nothing but the fact that L has recovered the true subspace. 
Similarly, having C satisfy Vx (C) = C means that we have succeeded in identifying the outliers. The 
oracle problem arises by imposing these as additional constraints in ©: 

Oracle Problem- Minimize: ||L||* + A||C|| li2 

Oracle Problem. Subj£ct to . M = L + C; V Uo (L) = L; V Xo (C) = C. U) 

The problem is of course bounded (by zero), and is feasible, as (L , C ) is a feasible solution. Thus, an 
optimal solution, denoted as L, C exists. We now show that the solution (L, C) to the oracle problem, is 
also an optimal solution to Outlier Pursuit. Unlike the original pair (L , Co), we can certify the optimality 
of (L, C) by constructing the appropriate subgradient witness. 

The next lemma and definition, are key to the development of our optimality conditions. 

Lemma 4: Let the pair (L',C) satsify L' + C = M, V Uo (L') = V, and V Io (C) = C Denote the 
SVD of V as I! = U'EV' T , and the column support of C as T. Then U'U' T = U Uj , and X' C X . 

Proof: The only thing we need to prove is that L' has a rank no smaller than U . However, since 
Vxo{C) = C, we must have V X o{L') = V T c(M), and thus the rank of V is at least as large as V X o{M), 
hence V has a rank no smaller than U . ■ 

Next we define two operators that are closely related to the subgradient of ||Z/||* and HC'Hi^- 



s 



Definition 1: Let (L',C) satisfy V + C 
following: 



M, V Uo {L') = V, and V Xq (C) = C. We define the 



(5(C) = 



H G 



V IS (H)=0; WgX' : H t 



\C'\ 



Vi e x n (x 



Ac . 



#ill 2 <l 



where the SVD of U is L' = U'HV , and the column support of C is T! . Further define the operator 

V T {L'){-) ■ ^ mxn -»■ M mxn as 

7>t(lo(X) = TV'C-X") + TV' W - Vu>V V '{X). 

Now we present and prove the optimality condition (to Outlier Pursuit) for solutions (L, C) that have 
the correct column space and support for L and C, respectively. 

Theorem 3: Let (L',C) satisfy V + C = M, V Uo (L') = V, and V Xo (C) = C. Then {V , C) is an 
optimal solution of Outlier Pursuit if there exists a matrix Q G M mxn that satisfies 

(a) V n v ) {Q) = K{L') ] 



(b) \\V: 



T(L') 



< 1; 



(8) 



(c) Px o (Q)/Ae0(C'); 

(d) iiPzs(g)|ioo,2<A. 

If both inequalities are strict (dubbed Q strictly satisfies ©), and V Xo H 7V' = {0}' men an Y optimal 
solution will have the right column space, and column support. 

Proof: By standard convexity arguments [27], a feasible pair (L 1 , C) is an optimal solution of Outlier 
Pursuit, if there exists a Q' such that 

Q'e d\\L%; Qf e Xd\\C'\\ lj2 . 

Note that (a) and (b) imply that Q G d\\L'\\*. Furthermore, letting X' be the support of C, then by Lemma 
H T C Xo- Therefore (c) and (d) imply that 



IC'I 



and 



||Qi|| 2 < A; VzGT, 

which implies that Q G A9||C"||i )2 . Thus, (L',C) is an optimal solution. 

The rest of the proof establishes that when (b) and (d) are strict, then any optimal solution (L", C") 
satisfies V Uo (L") = L", and V Xo (C") = C". We show that for any fixed (L' + A, C'-A) is strictly 

worse than (L',C), unless A G V Uo nV Xa . Let W be such that \\W\\ = 1, (W,V T{LI) ±(A)) = \\V T{L/) ±A\\^ 
and V T (L')W = 0. Let F be such that 

Ft = 







if z G' X , and A* 7^ 
otherwise. 



Then Vt(L')(Q) + W is a subgradient of and V Xq (Q)/\ + F is a subgradient of ||C"||i )2 . Then we 

have 







+ A||* 


+ A||C"- 


- A||i )2 


> 


\L' 


I* + A 


C||l,2+ 


< Vt(l') 




\L' 


I* + A 


C II 1,2 + 


1 ^ > T(L') ± 




\L' 


I* + A| 


C II 1,2 + 


1 '^ > r(L') ± 




\L\ 


I* + A| 


C"l|l,2 + 


Ik T(L') ± 


> 


\L' 


I* + A| 


C II 1,2 + 


(l-ll^ 


> 


\L' 


I* + A 


C 1,2) 





T(L' 



T(L') 



V T(LI) ,{A)\l + (A - ||7VQ)lk2)||7>zg(A)|| 1)2 
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where the last inequality is strict unless 

\\V ni/) ±{A)\\. = ||P 2g (A)|| 1 , 2 = 0. (9) 

Note that © implies that V T (l')(A) = A and V Xo (A) = A. Furthermore 

V Xo (A) = A = V T{L >){A) = Vw(A) + VyiT v ,x. (A) = V Xo V w {A) + V V >V V ,±{A), 

where the last equality holds because we can write Pj (A) = A. This leads to 

V Xo V ul ^A)=Vv'V u ,4A). 

Lemma |4] implies Vu< = Vu , which means V U ±(A) G Vx flPy, and hence equal 0. Thus, A G Vu - 
Recall that Equation © implies A G V X() , we then have A G V Xo fl Vu Q , which completes the proof. ■ 

Thus, the oracle problem determines a solution pair, (L,C), and then using this, Theorem [3] above, 
gives the conditions a dual certificate must satisfy. The rest of the proof seeks to build a dual certificate 
for the pair (L,C). To this end, The following two results are quite helpful in what follows. For the 
remainder of the paper, we use (L, C) to denote the dual pair that is the output of the oracle problem, 
and we assume that the SVD of L is given as L = UT I V T . 

Lemma 5: There exists an orthonormal matrix V G R rxn such that 

uv T = u v T . 

In addition, 

Proof: Due to Lemma gj we have U Uj = UU T , hence U = UU^Uq. Letting V = VU T U , we 
have f)t> T = U V T , and W T = VV T . Note that U Uj = UU T leads to V v = V , and W T = VV T 
leads to Vy = V v , so the second claim follows. ■ 
Since L, C is an optimal solution to Oracle Problem ©, there exists Q l3 Q 2 , A' and B' such that 

Q 1+ V U ,(A') = Q 2 + V XS (B'), 

where Qi, Q 2 are subgradients to ||L||* and to A||C7 ||i i2 , respectively. This means that Q\ = U V T + W 
for some orthonormal V and W such that Vf(W) = 0, and Q 2 = X(H + Z) for some H G (5(C), and 
Z such that V Xo (Z) = 0. Letting A = W + A', B = XZ + B', we have 

U V T + V v ± (A) = XH + V n {B). (10) 

Recall that H G (5(C) means V Xo (H) = H and H^Hoo^ < 1. 
Lemma 6: We have 

U V Xo (V T ) = \V Uo (H). 

Proof: We have 

V Uo V Xo (U V T + V U(t (A)) = V Uo V Xo (U V T )+V Uo V Xo (V u ,(A)) 

= U V Xo (V T ) + V Uo P U(t V Xo (A) 

= u v Xo (V T ). 

Furthermore, we have 

V Uo V Xo (XH + V xs (B)) = XV Uo (H). 
The lemma follows from (flOl) . ■ 
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B. Obtaining Dual Certificates for Outlier Pursuit 

In this section, we complete the proof of Theorem \T\ by constructing a dual certificate for (L, C) - 
the solution to the oracle problem - showing it is also the solution to Outlier Pursuit. The conditions the 
dual certificate must satisfy are spelled out in Theorem [3j It is helpful to first consider the simpler case 
where the corrupted columns are assumed to be orthogonal to the column space of L which we seek 
to recover. Indeed, in that setting, we have V = V = V, and moreover, straightforward algebra shows 
that we automatically satisfy the condition Vx n7V = {0}- (In the general case, however, we require an 
additional condition to be satisfied, in order to recover the same property.) Since the columns of H are 
either zero, or defined as normalizations of the columns of matrix Co (i.e., normalizations of outliers), 
we immediately conclude that V Uo (H) = V Vo {H) = V T (H) = 0, and also V Xo (U V T ) = 0. As a result, 
it is not hard to verify that the dual certificate for the orthogonal case is: 

Qo = U Vj + XH . 

While not required for the proof of our main results, we include the proof of the orthogonal case in 
Appendix HI as there we get a stronger necessary and sufficient condition for recovery. 

For the general, non-orthogonal case, however, this certificate does not satisfy the conditions of Theorem 
|3j For instance, Vv (Ho) need no longer be zero, and hence the condition Vt{Qo) — U V T may no longer 
hold. We correct for the effect of the non-orthogonality by modifying Q with matrices Ai and A 2 , which 
we define below. 

Recalling the definition of V from Lemma define matrix G E W xr as 

G±V lQ (V T )(V Xo (V T )) T . 

Then we have 

n 

G = D^no^r * D( FT M(^ T )d T =y T v= i, 

i£lo t=l 

where ^ is the generalized inequality induced by the positive semi-definite cone. Hence, ||G|| < 1. The 
following lemma bounds ||G|| away from 1. 

Lemma 7: Let ijj = \\G\\. Then tp < \ 2 ^n. In particular, for A < we have ip < \. 

Proof: We have 

= ll^xo(^ T )(^ (F T )) T t/ T || = \\[U Q V Xo {V T )][U Q V Io {V T )] T l 

due to the fact that U is orthonormal. By Lemma [6l this implies 

V = \\[XV Uo (H)][XV Uo (H)] T \\ 

= x 2 \\Y, v ^)v Uo m T \\ 

< A 2 |J | 
= \ 2r yn. 

The inequality holds because \\Vu (Hi)\\2 < 1 implies \\Vu (Hi)Vu (Hi) T \\ < 1. ■ 

Lemma 8: If ip < 1, then the following operation VyVx^Vy is an injection from Vy to Vy, and its 
inverse operation is / + ^^(VyVxoVyY . 



1 1 



Proof: Fix matrix X G W xn such that ||X|| = 1, we have that 

VyV Xo Vy{X) = VyV Xo (XVV T ) 

= Py(XVV Xo (V T )) 

= xvv Xo (v T )vv T 

= XV(V Io (V T )V)V T 
= XVGV T , 

which leads to \\VyV Xo Vy(X)\\ < ip. Since ijj < 1, [I + E™i(^io%)i( x ) is wel1 defined, and has 
a spectral norm not larger than 1/(1 — ip). 
Note that we have 

VyVxoVy = Vy{l ~ VyV Xo Vy), 

thus for any X E Vy the following holds 

oo oo 

-PyV X cVy[I + Y,(PyV Xo Vy)%X) = Vy{l - VyV Xo Vy)[I + 

i=l i=l 

= P F (X) = X, 

which establishes the lemma. ■ 

Now we define the matrices Ai and A 2 used to construct the dual certificate. As the proof reveals, 
they are designed precisely as "corrections" to guarantee that the dual certificate satisfies the required 
constraints of Theorem [3] 

Define Ai and A 2 as follows: 

Ax 4 XV Uo (H) = U V Xo (V T ); (11) 

oo 

A 2 ^ Vuo±VlsVv[I+ J2(VyV Xo Vyy]Vy(XH) 

8=1 

oo 

= V XS VyiI+Y,ePyV Xo Vyy}VyV Uo 4\H). (12) 



8=1 



The equality holds since Vy, V Xq ,V X c are all given by right matrix multiplication, while Vjj± is given by 
left matrix multiplication. 

Theorem 4: Assume ij) < 1. Let 

7 / (l-^) 2 



< 



1 — 7 (3 — ifj) 2 nr ' 

and 



^(1 _ ^ _ ^Jj^fir) ' ' ( 2 - V0V™7 



then Q satisfies Condition ([8]) (i.e., it is the dual certificate). If all inequalities hold strictly, then Q strictly 
satisfies ®. 
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Proof: Note that ip < 1 implies PI Vx = {0}. Hence it suffices to show that Q simultaneously 
satisfies 

(1) V t (Q) = UV T ; 

(2) V V (Q) = UV T ; 

(3) V Xo (Q) = XH; 

(4) \\V f± (Q)\\ < 1; 

(5) \\Vx^Q)\U,2<\. 

We prove that each of these five conditions holds, in Steps 1-5. Then in Step 6, we show that the condition 
on A is not vacuous, i.e., the lower bound is strictly less than then upper bound (and in fact, we then 
show that A = y^== is in the specified range). 
Step 1: We have 

Vtj{Q) = V Uo (Q) 



V Ua (U V T + XH- A 1 - A 2 ) 

U V T + XV Uo (H) - TVo(Ai) - V Uo (A 2 ) 

u v T 

UV T . 



Step 2: We have 



Vy{Q) = V V {Q) = V V (U V T + XH-A 1 -A 2 ) 



U V T + Vy(XH) - Vy(XV Uo (H)) - Vy(Vy[I + ^^V^vT^V^ (A#)) 



U V T + Py(P Uo 4XH)) - VyV X {Py[l + £ (VyVx, Vy) *] VyV^ ( XH) 

1=1 



( = } Uy + Vy{V ut (XH)) - V V (V K (XH)) 

= u v T 

= UV T . 

Here, (a) holds since on Vy, [I + ^^(VyVxaVy) 1 } is the inverse operation of VyVx{Py- 
Step 3: We have 

Vx (Q) = Vx (U V T + XH- A x - A 2 ) 

oo 

= U Vx (F T ) + XH - V Xo (UoVx (F T )) - Vx Vx S V v [I + Y^v^v^v^t ( A #) 

8=1 

= XH. 

Step 4: We need a lemma first. 

Lemma 9: Given X e R pxn such that ||X|| = 1, we have \\Vx§V v (X)\\ < 1. 
Proof: By definition, 

VxcVy(X)=XVVx S (V T ). 
For any z£l" such that ||z|| 2 = 1, we have 



XVP X c(V )z|| 2 = \\XVV Vx S (z)\\ 2 < \\X\\\\VV ||||Pic(z)|| 2 < 1, 
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where we use Pjc(z) to represent the vector whose coordinates i e X are set to zero. The last inequality 
follows from the fact that ||X|| = 1. Note that this holds for any z, hence by the definition of spectral 
norm (as the £ 2 operator norm), the lemma follows. ■ 
Now we continue with Step 4. We have 

Vfx(Q) = V f± (U V T + XH-A 1 -A 2 ) 

oo 

= v v ,v K (xh) - v^Pux (v X cV v [i + J2(v v v Xo v v y}v v v U(t (XH)) 

1=1 

oo 

= V V ±V U ±(XH) - V vt V^Vz{Py\I + J2CP v v Ia v v y]v v (\H). 



i=l 



Let v = 1 1 XH 1 1. Recall that we have shown v < Aa/|X |. Thus we have \\Vy±V u ±(XH)\\ < v. Furthermore, 
we have the following: 

oo 

\\Vy(XH)\\<V \\[I + J2CPyV Xo V V y]Vy(\H)\\<v/(l-ij) 

i=l 

oo 

\\Vx{Py\I + Y.( P V^vTYPv^H)\\ < V/(1 - 

i=l 

oo 

=> \\V vt V^VjoVy\I + Y,^v^vf\Py{XH)\\ < v/(l - 



i=i 



Thus we have that 



v f± (Q)\\<^—^x^%\. 



4 

From the assumptions of the theorem, we have 

1 - ijj 



A < . 

(2 — ip)y/rvy 

and hence 

l|Pfx(Q)||<l. 

The inequality will be strict if 

A< ^ 



(2 — ^)y/wrj 



Step 5: We first need a lemma that shows that the incoherence parameter for the matrix V is no larger 
than the incoherence parameter of the original matrix V . 
Lemma 10: Define the incoherence of V as follows: 

-p = ma JM\\<p x JV T )e i \\ 2 . 

ieig r 

Then fl < fj,. 

Proof: Recall that L = UqZqVJ , and 

/^|?^HW T )e*|| 2 . 
Thus it suffices to show that for fixed i E X , the following holds: 

||Px S (F T ) ei || < ||7>x S (V T )e,||. 
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Note that Vxg(V ) and V X ^(V^) span the same row space. Thus, due to the fact that Vz°(V^) is 

orthonormal, we conclude that Vx^(V T ) is row-wise full rank. Since ^ Vx^(V T )Vx^(V T ) T = I — G, 
and G y 0, there exists a symmetric, invertible matrix Y G W xr , such that 



< 1; and Y 2 = Vx S (V T )V xs (V T ) T . 

This in turn implies that Y^ 1 Vx^ ) (V T ) is orthonormal and spans the same row space as Vx^(V T ), and 
hence spans the same row space as Vxs(Vj). Note that 'Pxg(V T ) is also orthonormal, which implies there 
exists an orthonormal matrix Z E M rxr , such that 

ZY- 1 Vxc(V T )=Vx 8 (V T ). 

We have 

||P2 S (F T ) ej || 2 = \\YZ T Vx S (Vj)e l \\ 2 < \\Y\\\\Z T \\\\V x§ (Vj)e i \\ 2 < ||7\c(y o T ) ei || 2 . 

This concludes the proof of the lemma. ■ 
Now, recall from the proof of Lemma [8] that 

VyVx V v (X) = XVGV T . 

Hence, noting that {V v Vx Q V v Y = (V v Vx V v )(V v Vx V v ) i - 1 and F T F = I, by induction we have 



(v v v Xo Vyy(x) = xvg*v t . 

We use this to expand A 2 : 

oo 

A 2 = VuS-Vx^I +Y<( V vVx Vy) i YPy{\H) 

i=l 

oo 

= (/ - U Uj)(XH)VV T [l + J2vG l v T ]vvxc(y T ). 

i=l 

Thus, we have 

oo 

||A 2 e 4 || 2 < W T )||||(Ai/)||||W T ||||l + ^FG ? 'F T ||||F||||Px S (F T )e 4 || 2 



i=l 



< \\xm 1 / ^ r 



1 — if; y n — \Xq 



< - 



fir 



Pol 



l-lf) 

where we have used Lemma [10] in the last inequality. This now implies 



A 2 || 00 , 2 < - 



Pol 



1 -if) 

Notice that 



|Pz=(Q)||oo,2 =\\Vx S (U V T +\H-Ai- A 2 )||oo,2 

= \\U Vx S (V T )- A 2 |U i2 
<IIW s (F T )|| 00 , 2 + ||A 2 || 00i2 



fir v i^uiy n-\Xo\ 



fir 



n — 1 — if) 
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Therefore, showing that 1 1 Pig (Q) 1 1 00,2 < A is equivalent to showing 




as long as 1 — if) — ^J-^fir > (which is proved in Step 6). 

Step 6: We have shown that each of the 5 conditions holds. Finally, we show that the theorem's 
conditions on A can be satisfied. But this amounts to a condition on 7. Indeed, we have: 



7 1 — if) 



y^(l - if) - (2-^)v^7 



7 < (i-vo 



2 



1 — 7 (3 — if>) 2 fj,r ' 

which can certainly be satisfied, since the right hand side does not depend on 7. Moreover, observe that 
under this condition, 1 — ip — jz^^r > holds. Note that if the last inequality holds strictly, then so 

does the first. ■ 
We have thus shown that as long as ip < 1, then for A within the given bounds, we can construct a 

dual certificate. From here, the following corollary immediately establishes our main result, Theorem \T\ 

3 



Corollary 1: Let 7 < 7*, then Outlier Pursuit, with A = , 3 1 , strictly succeeds if 



7* < 9 



Proof: First note that A = 7 t* n an d 7 < 7* imply that 



1 _ 7 * - Ulfir 
ply 

3 



A< , 

which by Lemma [7] leads to 

ib < A 2 7n < -. 

4 

Thus, it suffices to check that 7 and A satisfies the conditions of Theorem @J namely 

7 (l-^) 2 



1 — 7 (3 — ip) 2 fir ' 
and 

< A < 



Vn(l - ib - (2 - VOv^y 
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Since tp < 1/4, we have 



7 < 7* < 9 (1-1/4) 2 ^ (l-^) 2 



1 -7 ~ 1 -7* ~ 121/xr (3-l/4)V (3 - V) V 
which proves the first condition. 

(1-1/0 \f^z- 

Next, observe that — — — , v /~ 7 r , as a function of ?/> 7, fur) is strictly increasing in fur), and 7. 
Moreover, [ir < ^T^z^fe-^ , and thus 

(i-VOy^ ^ (i - i')^^ _ syrrw^T) < 3y/i + 7 V(i - r) _ A 



Vn(l - V> ~ J^r) >/n(l - V - |=^) " 7 v^ 

(2-V>)V"7 



Similarly, * ,^ is strictly decreasing in ?/> and 7, which implies that 



l-ip 1 — 1/4 

> . . = A. 



(2-^)^ (2-l/4)VnT 



V. Proof of Theorem [2j The Case of Noise 

In practice, the observed matrix may be a noisy copy of M. In this section, we investigate this noisy 
case and show that the proposed method, with minor modification, is robust to noise. Specifically, we 
observe M' = M + N for some unknown N, and we want to approximately recover U and X . This leads 
to the following formulation that replaces the equality constraint M = L + C with a norm inequality. 

Minimize: IILIL + A II C II i 9 

11 11* 11 111,2 (13) 

Subject to: \\M' - L - C\\ F < e. 

In fact, we show in this section that under the essentially equivalent conditions as that of the noiseless case, 
Noisy Outlier Pursuit succeeds. Here, we say that the algorithm "succeeds" if the optimal solution of (IT~3b 
is "close" to a pair that has the correct column space and column support. To this end, we first establish 
the next theorem - a counterpart in the noisy case of Theorem |3] - that states that Noisy Outlier Pursuit 
succeeds if there exists a dual certificate (with slightly stronger requirements than the noiseless case) 
for decomposing the noiseless matrix M. Then, applying our results on constructing the dual certificate 
from the previous section, we have that Noisy Outlier Pursuit succeeds under the essentially equivalent 
conditions as that of the noiseless case. 

Theorem 5: Let L',C be an optimal solution of ([T3T ). Suppose \\N\\ F < e, A < 1, and ip < 1/4. Let 
M = L + C where Vu(L) = L and V Xo {C) = C. If there exists a Q such that 

V T{L) (Q)=m(L); \\V T(L) 4Q)\\<l/2; V Io (Q)/\e&(C)- \\V^{Q)\\oo,2 < A/2, (14) 

then there exists a pair (L, C) such that M = L + C, L e Vu , C £ Vx and 

\\L' - L\\ F < lOVne; \\C - C\\ F < 9y/ne. 
Proof: Let V be as defined before. We establish the following lemma first. 
Lemma 11: Recall that i/j = ||G|| where G = Vi Q (V T )V Xq (V T ) T . We have 

\\V lQ V v V Xo (X)\\ F <^\\X\\ F . 
Proof: Let T £ R nxn be such that 

T = f 1 if i = j, i e X; 
lj I otherwise. 



17 



We then expand V Xo VyV Xo (X), which equals 



XTVV T = XTVV T = X(TV)(TV) 1 = XV Xo (V ) 1 V Xo (V ). 

The last equality follows from (TV) T = V Xo (V T ). Since ip = \\G\\ where G = V Xo (V~ T )V Xo (V T ) T , we 
have 

\\v Xo (v T ) T v Xo (v T )\\ = \\v Xo (V T )v Xo (v T ) T \\ = ^. 

Now consider the i row of X, denoted as x\ Since \\V Xo (V ) T V Xo (V )|| = i/j, we have 

\\^ Xo (v T ) T v X[ AV T )\\l<^\\l 

The lemma holds from the following inequality. 

\\v Xo v v v Xo (x)\\ 2 F = \\xv Xo (v T ) T v Xo (V T )\\l = \\^-p Xo (v t ) t v Xo (v t )\\1 < ^J2\\*Wl = ^W x Wf- 

i 

u 

Let Nl — L' — L, and N c = C - C. Thus N = N c + N L , and recall that ||iV|| F < e. Further, define 
N+ = N L - V Xo V Uo (N L ), N+ = N C - V Xo V Uo (N c ), and N + = N - V Xq V Uo (N). Observe that for any 
A, - V Xo Vu )(A)\\ F < \\A\\ F . 

Choosing the same W and F as in the proof of Theorem [3l we have 



||£||, + A||C f ||i |2 > 11^11, + A||C||i, 2 
>||L||, + X\\C\\ h2 + (V T(t) (Q) + W, N L ) + \{V Xo (Q)/X + F, N c ) 

* + AIICH^ + \\V T{t)± (N L )\U + A||P I§ (iV c )||i, 2 + (V T{L) (Q),N L ) + (V Xo (Q),N c ) 

* + X\\d\\ lfl + \\V T(L) ±(N L )\U + A||Px§(iV c )||i )2 - (V T(L) 4Q), N L ) - (V XS (Q), N c ) + (Q, N L + N c ) 
. + A||C|| ll2 + (1 - \\V T(Lf (Q)\\)\\V T(Lf (N L )\U + (A - ||Pig(g)||oo,2)||Pz § (iV c )||i, 2 + (Q,N) 

* + A||(7|| 1>2 + (1/2)11^x^)11, + (A/2)||7\c(iV c )|| l!2 - e ||Q|| F . 

Note that ||Q||oo,2 < A, hence ||<5||f < y^-A. Thus we have 



= \\L 
= \\L 
>\\L 
>\\L 



\\P t{L) 4Nl)\\ F < \\V T{L) ±(N L )\U < 2Av^ e; 



(15) 



\\V X c(N c )\\ F < \\V X o{N c )\\ li2 < 2v^e. 

Furthermore, 

V Xo (N+) =V Xo {N c ) - V Xo V Uo V Xo (N c ) 

=V Xq (N) - V Xo V nLf (N L ) - V Xo V T{L) (N L ) - V Xq V Uq V Xq (N c ) 

=V Xo (N) - V Xo V T{L) 4N L ) - V Xo V T (A) + V Xo V T{t) (N c ) - V Xo V Uo V Xa (N c ) 

=V X() (N) - V Xo V T{if (N L ) - V Xo V T(L) (A) + V Xo V T{L) V xs (N c ) (16) 

+ V Xo V T{L) V Xo (N c ) - V Xa V UQ V Xo (N c ) 
=V Xo {N) - V Xo V T{L) 4N L ) - V Xo V T{L) (A) + V Xo P T{L) V X8 (N c ) + V Xo V T{t) V Xo (N+) 

=V Xo {N) - V Xo V T{i) 4N L ) - V Xo V T{L) (A) + V Xo V nL) V X o{N c ) + V Xo VyV Xo {N+) . 
Here (a) holds due to the following 

V Xi) V T(L) V Xo (N+) = V Xi) V T(L) V Xo (N c )-V Xo V T{L) V Xo (V Xo V Uo (N c ^ = V Xo V T{L) V Xi) (N c )-V Xo V Uo V Xo (N c ) 
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and (b) holds since by definition, each column of N£ is orthogonal to Uq, hence VuoVx^N^) = 0. Thus, 
Equation (fT6l ) leads to 

\\Vz (N+)\\ F 

<\\V Xo (N) - V Xo V t{L) (N)\\f + \\V Xo V T{Lf (N L )\\ F + \\V Xo V T{L) V xs (N c )\\ F +\\V Io V v V Xo (N+)\\ F 

<ll^l|F+||P r(i) 4^L)||F+||Px S (iV C )|| F + ^||P Xo (iV+)|| F 

<(1 + 2Av^ + 2v^)e + ij\\V Xo (N+)\\ F . 
This implies that 

\\Vx (N+)\\ F < (1 + 2Av^+ 2v^)e/(l - V)- 

Now using the fact that A < 1, and ip < 1/4, we have 

\WM\f = WPx S (N c ) +V Xo (N+)\\ F < \\V xs (N c )\\ F +\\V Xo (N+)\\ F < 9^e. 

Note that N+ = (I- V Xo V Uo )(C' -C) = C'-[C + V Xq V Uo (C - C)\. Letting C = C + V Xo V Uo (C' - C), 
we have C G V Xa and \\C — C\\ F < 9y/ne. Letting L = L — V Xo Vu (C — C), we have that L, C is a 
successful decomposition, and 

\\L' - L\\ F < \\N\\ F + \\C - C\\ F < lOv^e. 

■ 

Remark: From the proof of Theorem HI we have that Condition (fT4l) holds when 

7 < (l-^) 2 
1 - 7 ~ (9 - 4ip) 2 fi r 

and 

<A< i-^ 

v^(l - V - ^i^Aor) 2 ( 2 " Wv^' 

For example, one can take 

= yg + 1024/ipr 

and all conditions of Theorem [5] hold when 

7 < 9 
1 - 7 ~ 1024/i r' 

This establishes Theorem [2l 



VI. Implementation issues and numerical experiments 

While minimizing the nuclear norm is known to be a semi-definite program, and can be solved using 
a general purpose SDP solver such as SDPT3 or SeDuMi, such a method does not scale well to large 
data-sets. In fact, the computational time becomes prohibitive even for modest problem sizes as small 
as hundreds of variables. Recently, a family of optimization algorithms known as proximal gradient 
algorithms have been proposed to solve optimization problems of the form 

minimize: g(x), subject to: *4.(x) = b, 

of which Outlier Pursuit is a special case. It is known that such algorithms converge with a rate of 0(k~ 2 ) 
where k is the number of variables, and significantly outperform interior point methods for solving SDPs 
in practice. Following this paradigm, we solve Outlier Pursuit with the following algorithm. The validity 
of the algorithm follows easily from [28], [29]. See also [30]. 
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Input: M G R mxn , A, 5 : = 10~ 5 , rj := 0.9, /i := 0.99||M|| F . 

1) L_ 1; L := O mxm ; C_i, C := O mx ", t := 1; /2 = fy; 

2) while not converged do 

3) ^fc L := Lk + — {Lk — Lk-i), Y k := Ck + — (Cfc — Cfc-i); 

4) Cfc := n L " K 1 ? + n C -M);G% := Y k c - \\Y k L + Y k c - M); 

5) (U,S,V) := svd(G^); L fc+1 := U£^(S)V; 

6) C fe+1 :=£^(Gf); 

2 

_. , 1+a/4*?+1 / _n , 

7) ijfc+i := — ^ — ; /ife+i := max(7y// fe .//J; fc + +; 

8) end while 
Output: L := L k , C = C k . 



Here, £ e (S) is the diagonal soft-thresholding operator: if |jSif| < e, then it is set to zero, otherwise, we 
set Sa := Su — e ■ sgn(Sy. Similarly, € e (C) is the column- wise thresholding operator: set C, to zero if 
1 1 Ci 1 1 2 < e 5 otherwise set C{ := G{ — eC^ / 1| j| 2 - 

We explore the performance of Outlier Pursuit on some synthetic and real-world data, and find that 
its performance is quite promising^ Our first experiment investigates the phase-transition property of 
Outlier Pursuit, using randomly generated synthetic data. Fix n = p = 400. For different r and number 
of outliers jn, we generated matrices A e M pxr and B E ^{n-jr^xr w ] iere en try is an independent 
Af(0, 1) random variable, and then set L* := A x B T (the "clean" part of M). Outliers, C* G M 7nxp are 
generated either neutrally, where each entry of C* is iid Af(0, 1), or adversarially, where every column 
is an identical copy of a random Gaussian vector. Outlier Pursuit succeeds if C G Vx, and L G Vu- 
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Fig. 1. This figure shows the performance of our algorithm in the case of complete observation (compare the next figure). The results 
shown represent an average over 10 trials. 



Figure Q] shows the phase transition property. We represent success in gray scale, with white denoting 
success, and black failure. When outliers are random (easier case) Outlier Pursuit succeeds even when 
r = 20 with 100 outliers. In the adversarial case, Outlier Pursuit succeeds when r x 7 < c, and fails 
otherwise, consistent with our theory's predictions. We then fix r = = 5 and examine the outlier 
identification ability of Outlier Pursuit with noisy observations. We scale each outlier so that the £ 2 
distance of the outlier to the span of true samples equals a pre-determined value s. Each true sample is 
thus corrupted with a Gaussian random vector with an £ 2 magnitude a. We perform (noiseless) Outlier 
Pursuit on this noisy observation matrix, and claim that the algorithm successfully identifies outliers if 
for the resulting C matrix, j | C*j 1 1 2 < \\Ci\\2 for all j G" X and i G X, i.e., there exists a threshold value to 

'We have learned that [31] has also performed some numerical experiments minimizing || ■ ||» + A|| ■ \\i,2, and found promising results. 
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separate out outliers. Figure Q] (c) shows the result: when a/s < 0.3 for the identical outlier case, and 
a/s < 0.7 for the random outlier case, Outlier Pursuit correctly identifies the outliers. 

We further study the case of decomposing M under incomplete observation, which is motivated by robust 
collaborative filtering: we generate M as before, but only observe each entry with a given probability 
(independently). Letting Vt be the set of observed entries, we solve 

Minimize: ||L||* + A||C||i )2 ; Subject to: V n (L + C) = V Q (M). (17) 

The same success condition is used. Figure [2] shows a very promising result: the successful decomposition 
rate under incomplete observation is close the the complete observation case even only 30% of entries are 
observed. Given this empirical result, a natural direction of future research is to understand theoretical 
guarantee of (fTTT) in the incomplete observation case. 

(a) 30% entries observed (b) 80% entries observed (c) success rate vs observation ratio 




Outlier Number Outlier Number Fraction of Observed Entries 



Fig. 2. This figure shows the case of partial observation, where only a fraction of the entries, sampled uniformly at random, are observed. 

Next we report some experimental results on the USPS digit data-set. The goal of this experiment is 
to show that Outlier Pursuit can be used to identify anomalies within the dataset. We use the data from 
[32], and construct the observation matrix M as containing the first 220 samples of digit "1" and the last 
11 samples of "7". The learning objective is to correctly identify all the "7's". Note that throughout the 
experiment, label information is unavailable to the algorithm, i.e., there is no training stage. Since the 
columns of digit "1" are not exactly low rank, an exact decomposition is not possible. Hence, we use 
the £ 2 norm of each column in the resulting C matrix to identify the outliers: a larger £ 2 norm means 
that the sample is more likely to be an outlier — essentially, we apply thresholding after C is obtained. 
Figure [2a) shows the £ 2 norm of each column of the resulting C matrix. We see that all "7's" are indeed 
identified. However, two "1" samples (columns 71 and 137) are also identified as outliers, due to the 
fact that these two samples are written in a way that is different from the rest of the "l's" as shown in 
Figure SI Under the same setup, we also simulate the case where only 80% of entries are observed. As 
Figure [3] (b) and (c) show, similar results as that of the complete observation case are obtained, i.e., all 
true "7's" and also "l's" No 71, No 177 are identified. 

VII. Conclusion and Future Direction 

This paper considers robust PCA from a matrix decomposition approach, and develops the Outlier 
Pursuit algorithm. Under some mild conditions that are quite natural in most PCA settings, we show 
that Outlier Pursuit can exactly recover the column support, and exactly identify outliers. This result is 
new, differing both from results in Robust PCA, and also from results using nuclear-norm approaches for 
matrix completion and matrix reconstruction. One central innovation we introduce is the use of an oracle 
problem. Whenever the recovery concept (in this case, column space) does not uniquely correspond to a 
single matrix (we believe many, if not most cases of interest, fit this description), the use of such a tool will 
be quite useful. Immediate goals for future work include considering specific applications, in particular, 
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(a) Complete Observation 



(b) Partial Obs. (one run) 



(c) Partial Obs. (average) 
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Fig. 3. This figure shows the £2 norm of each of the 220 columns of C. Large norm indicates that the algorithm believes that column is 
an outlier. All "7's" and two "IV are identified as outliers. 





lit 



"1" "7" No 71 No 177 

Fig. 4. This figure shows the typical "l's", the typical "7's" and also the two abnormal "l's" identified by the algorithm as outliers. 



robust collaborative filtering (here, the goal is to decompose a partially observed column-corrupted matrix) 
and also obtaining tight bounds for outlier identification in the noisy case. 
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Appendix I 
Orthogonal Case 

This section investigates the special case where each outlier is orthogonal to the span of true samples, 
as stated in the following assumption. 

Assumption 1: For i E X , j X , we have Mj Mj = 0. 

In the orthogonal case, we are able to derive a necessary and sufficient condition of Outlier Pursuit 
to succeed. Such condition is of course a necessary condition for Outlier Pursuit to succeed in the more 
general (non-orthogonal) case. Let 

h q = { nSb if?eX °; 

[ otherwise. 
Theorem 6: Under Assumption \T\ Outlier Pursuit succeeds if and only if 

||#o||<l/A; ||W T ||oo,2< A. (18) 

If both inequalities hold strictly, then Outlier Pursuit strictly succeeds. 

Corollary 2: If the outliers are generated adversarial, and Assumption \T\ holds, then Outlier Pursuit 
succeeds (for some A*) if and only if 

1 — 7 fir 



Specifically, we can choose A* = 




A. Proof of Theorem |6| 

The proof consists of three steps. We first show that if Outlier Pursuit succeeds, then (L , C7 ) must be 
an optimal solution to Outlier Pursuit. Then using subgradient condition of optimal solutions to convex 
programming, we show that the necessary and sufficient condition for (L , C7 ) being optimal solution is the 
existence of a dual certificate Q. Finally, we show that the existence of Q is equivalent to Condition (fl"8l 
holds. We devote a subsection for each step. 
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1) Step 1: We need a technical lemma first. 
Lemma 12: Given A E W mxn , we have 

\\V XS (A)\U< \\A\U. 

Proof: Fix r > rank(A). It is known that has the following variational form (Lemma 5.1 

of [22]): 

\\A\\, = Minimize: XeRm x, iyeRn x, -(\\X\\ 2 F + \\Y\\ 2 F ) 
Subject to: XY T = A. 
Note that for any XY T = A, we have 

XY T = X(V xs (Y T ))=V X c(A), 

where Y is the matrix resulted by setting all rows of Y in X to zero. Thus, by variational form of 
H^zgCA)!!** and note that rank(Vx^(A)) < r, we have 

\\Vx S (A)\U < k\\X\\% + \\Y\\ F ] < k\\X\\% + \\Y\\ F ]. 



Note this holds for any X, Y such that XY T = A, the lemma follows from (fT9l) . ■ 
Theorem 7: Under Assumption [TJ for any V , C such that L' + C = M, V Xo {C) = C, and V Uo (L') = 
L', we have 

||-^o||* + ^||Co||l,2 < ll-^'ll* + A||C 1 1 1,2 5 

with the equality holds only when L' = L and C = C . 

Proo/: Write V = L + A and C" = C - A. Since V Uo (L') = L', we have that for i e X , 
Vu ^i = Aj, which implies that for i £ J 

CjA, = (C^ i U)U T A i = x f/ T A t , 

where the last equality holds from Assumption [Hand the definition of Co (recall that Co« is the i th column 

of C )- Thus, || C 1| 1,2 = Eiex II Co* II 2 < E ieXo \\C 0i + A t \\ 2 < YT l= i ||Cw + = \\ C 'h& witn equality 
only holds when A = 0. 

Further note that Vi (C) = C implies that 7 3 x (A) = A, which by definition of L leads to 

Thus, Lemma [T21 implies ||Lo||* < ||£'||*- The theorem thus follows. ■ 
Note that Theorem [7] essentially says that in the orthogonal case, if Outlier Pursuit succeeds, i.e., it outputs 
a pair (L', C) such that V has the correct column space, and C has the correct column support, then 
(Lo, Co) must be the output. This makes it possible to restrict out attention to investigate when the solution 
to Outlier Pursuit is (L , C ). 
2) Step 2: 

Theorem 8: Under Assumption [TJ (X 0) C ) is an optimal solution to Outlier Pursuit if and only if there 
exists Q such that 

(a) V To (Q) = UoV T ; 

(b) ||7V(Q)||<1; m 

(c) V Xo (Q) = XH ; 

(d) 11^(^)1100,2 < A. 

Here Pr (") — ^ ? r(L )(')- m addition, if both inequalities are strict, then (L ,C ) is the unique optimal 
solution. 
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Proof: Standard convex analysis yields that (L , C ) is an optimal solution to Outlier Pursuit if and 
only if there exists a dual matrix Q such that 

Qe<9||L ||*; Q e <9A||Co||i, 2 . 

Note that a matrix Q is a subgradient of || • ||* evaluated at L if and only if it satisfies 

V To (Q) = U V T ; and ||P T x(Q)|| < 1. 

Similarly, Q is a subgradient of A|| ■ || x,2 evaluated at Co if and only if 

V Xo (Q) = XH ; and \\Vx S (Q)\\oo,2 < A. 

Thus, we conclude the proof of the first part of the theorem, i.e., the necessary and sufficient condition 
of (Lo> Co) being an optimal solution. 

Next we show that if both inequalities are strict, then (L ,Co) is the unique optimal solution. Fix 
A 7^ 0, we show that (L + A, C — A) is strictly worse than (L , C ). Let W be such that \\W\\ = 1, 
(W,V T ±(A)) = ||P T xA||*, and V To W = 0. Let F be such that such that 




ifz^Xo, and A^O 
otherwise. 



Then UqV + W is a subgradient of || • ||* at L and H + F is a subgradient of || ■ ||i j2 at Cq. Then we 
have 

\\Lq + A||* + X\\C - A|| 1j2 



> 


-^o| |* " 


hA 


|C 


1,2 


-h < U Vj + W, A > -A < H + F, A > 




-^o| |* " 


h A 


|C 


1,2 


+ ||P T x(A)||* + A||P2g(A)||i, 2 + < U V T - XHo, A > 




-^o| * " 


hA 


\C 


1,2 


+ ||P r x(A)||* + A||Pxc(A)|| li2 + < Q - V T ±(Q) -{Q- V X *(Q)), A > 




-^ol |* " 


hA 


\C 


1,2 


+ ||P T x(A)||* + A||P X c(A)|| li2 + < —V T ±(Q), A > + < Vt S (Q),A> 


> 


A)| * " 


hA 


\C 


1,2 


+ (1 - ||P r x(g)||)||P r x(A)||, + (A - ||7>2 S (0)||oo,2)||% g (A)|| li2 


> 


1-^0 ||* " 


hA 


|C 


1,2 





where the last inequality is strict unless 

||P T x(A)||* = ||P 2 c(A)|| 1 , 2 = 0. (21) 

We next show that Condition (l2~TT) also implies a strict increase of the objective function to complete the 
proof. Note that Equation (1211) is equivalent to A = Vt q (A) = V Xo (A), and note that 

V Uo (A) = V To (A) - V Vo (A) + V Uo V Vo (A) = A - (I - V Uo )V Vo A. 

Since V Xo (V T ) = 0, V Xo (A) = A implies that V Vo (A) = 0, which means 

A = V Uo (A)=V lQ (A). 

Thus, V Uo (L + A) = L + A, and V Xo (C - A) = C - A. By Theorem \\L + A||, + A||C - A|| 1>2 > 
||L ||* + A || C 1| 1,2, which completes the proof. ■ 
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3) Step 3: 

Theorem 9: Under Assumption!!] if there exists any matrix Q that satisfies Condition (fT8l) , then U Vj + 
XH satisfies (fT8l> . 

Proof: Denote Qo — UqVq + XH . We first show that the two equalities of Condition (TT8T) hold. 
Note that 

Vt (Qo) = Pt (U VJ) + XVt (H ) = U Vj + X[P Uo (H ) + V Vo (H ) - V Uo V Vo (H )]. 

Further note that Vu (H ) = U (UjH ) = due to Assumption!!] and Vv (H ) = because Vz H = H 
and V Xo {Vj) = lead to H V = 0. Hence 

V Ta (Qo) = U V T . 

Furthermore, 

V Xo (Qo) = Vi (U Q Vj) + XVi (H ) = U Vx (V T ) + XH = \H . 
Here, the last equality holds because Vx (V^) = 0. Note that this also implies that 

V T ±(H )=H ; Px°(U V T ) = U Q Vj. (22) 

Now consider any matrix Q that also satisfies the two equalities. Let Q = U Vj + XH + A, note that 
Q satisfies Vx (Q) = XH and Vr (Q) = U Vq , which leads to 

Px (A) = 0; andP To (A) = 0. 

Thus, 

V XS (Q) = U Vj + A; and V T ± (Q) = XH + A. 

Note that 



\UoVj + AH^ = max \\U (V l ), + A; 



i 2 



>max||[/ (K ) T ) i || 2 = ||?7oK 



Iloo,2- 

Here, the inequality holds because Vr (A) = implies that A, are orthogonal to the span of U. Note 
that the inequality is strict when A^O. 
On the other hand 

llAiToll = max x 7 (XHq)\ = max x T (A//o)y 

IN|<l,||y||<l l|x||<l,||y||<l,P x c(yT)=0 

= max x T (A# + A)y< max x T (Aif + A)y = ||Aif + A||. 

l|x||<l,||y||<l,P x g(y T )=0 l|x||<l,||y||<l 

Here, (a) holds because Vx H = H , thus for any y, set all yi = for i I does not change x T (Aifo)y; 
while (b) holds since PxgA = A. 

Thus, if Q satisfies the two inequalities, then so does Q , which completes the proof. ■ 

Note that by Equation (1221) we have 

V T ±(H ) = H 0] Vx § (U V T ) = U V T . 
Thus, Theorem U\ Theorem [8] and Theorem [9] together establish Theorem [6l 
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B. Proof of Corollary [2] 

Corollary [2] holds due to the following lemma that tightly bounds ||i?o|| an d ||£^oV^ T ||oo,2- 



Lemma 13: We have (I) \\H \\ < y/^yn, and the inequality is tight. (II) ||i7o^o T ||oo,2 = max, || V T e, 



fir 



Proof: Following the variational form of the operator norm, we have 



liToll = max x T ifoy — max ||x T i?o||2 = max 

l|x||2<l,||y||2<l l|x|| 2 <l l|x|| 2 <l 



The inequality holds because ||(-Ho)j||2 = 1 when i e Z , and equals zero otherwise. Note that if we let 
(Ho)i all be the same, such as taking identical outliers, the inequality is tight. 

By definition we have ||£/oVo T ||oo,2 = max,; ||t/o(y o T )i||2 = max; ||(V^ T )j|| 2 = max; ||F T ej|| 2 . Here (a) 
holds since Uq is orthonormal. The second claim hence follows from definition of fj,. ■ 
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